Data Assessment¶

Group 8¶

GitHub Link: https://github.com/AabritiKarki/CapstoneProject_Group8¶

In our project we are dealing with the data collected from the client intake form developed by our team in which they will fill out the questions based on the mental health issues that they have been facing and need assistance with. We will be developing models on the basis of the data from the previous group’s data collections as well as the data collected from the form developed by us. This will help medical professionals in terms of the time-consuming process of scanning and skimming the intake forms and prescribing the best possible intervention according to the severity of their situation. As we will be labeling the patient according to the severity of their condition.

Data Quality¶

What is the Dataset about?¶

The dataset collected from the form has the collection of different type of information including the demographics which has around 6 questions, medical basic questions on Anxiety, PTSD, Trauma, Substance abuse which has around 17 questions in each persona and open-ended question collected from St. Clair students as well as the patients of the CogniXR.

Exploring other Datasets¶

We are dealing with an external project proposed by CogniXR and we have developed a patient intake form for this project. We will be using the data collected by us and CogniXR for the analysis. However we will make use of the data collected by the previous group for basing the model and upgrade it. Other than that we are exploring other datasets. Our project will solely be based on the data collected by ourselves and CogniXTR.

Library and Data Import¶

In [1]:
# Library Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
<frozen importlib._bootstrap>:228: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
In [2]:
# Read Data
df = pd.read_excel('Mental Health intake form(1-19).xlsx')
df.head()
# Drop first 5 meta-data columns added by MS-Forms
df = df.drop(df.columns[:5], axis=1)
In [3]:
column_names = [
    #General
    'initials', 'gender', 'age', 'address', 'occupation', 'income', 'have_prev_diagnosis', 'felt_condition', 
   'experienced_symptoms', 'medication_text', 'treatment_receieved_past', 'treatment_receieved_present',
   'daily_affect', 'mental_condition', 
    #PTSD
    'last_symptom_ptsd', 'present_treatment_ptsd', 'therapist_support_ptsd','daily_affect_ptsd',
    'goals_ptsd', 'expectations_ptsd', 'relationship_changes_ptsd', 'sleep_impact_ptsd',
    'sleep_patterns_ptsd', 'mood_ptsd', 'mood_change_ptsd', 'general_effect_text_ptsd', 'financial_ptsd',
    'chance_ptsd', 'cope_text_ptsd', 'comfortable_treatment_ptsd', 'helpful_treatment_text_ptsd', 
    'more_concerns_ptsd',
    #Anxiety
    'last_symptom_anxiety', 'present_treatment_anxiety', 'therapist_support_anxiety', 'daily_affect_anxiety', 
    'goals_anxiety', 'expectations_anxiety', 'relationship_changes_anxiety', 'sleep_impact_anxiety',
    'sleep_patterns_anxiety', 'mood_anxiety', 'mood_change_anxiety', 'general_effect_text_anxiety', 'financial_anxiety',
    'chance_anxiety', 'cope_text_anxiety', 'comfortable_treatment_anxiety', 'helpful_treatment_text_anxiety', 
    'more_concerns_anxiety',
    #Trauma
    'last_symptom_trauma', 'present_treatment_trauma', 'therapist_support_trauma', 'daily_affect_trauma',
    'goals_trauma', 'expectations_trauma', 'relationship_changes_trauma', 'sleep_impact_trauma',
    'sleep_patterns_trauma', 'mood_trauma', 'mood_change_trauma', 'general_effect_text_trauma', 'financial_trauma',
    'chance_trauma', 'cope_text_trauma', 'comfortable_treatment_trauma', 'helpful_treatment_text_trauma', 
    'more_concerns_trauma',
    #SUD
    'history_sud', 'present_treatment_sud', 'therapist_support_sud', 'daily_affect_sud',
    'goals_sud', 'expectations_sud', 'relationship_changes_sud', 'sleep_impact_sud',
    'sleep_patterns_sud', 'mood_sud', 'mood_change_sud', 'legal_issue_sud',  'financial_sud',
    'chance_sud', 'cope_text_sud', 'comfortable_treatment_sud', 'helpful_treatment_text_sud', 
    'more_concerns_sud',
    #Closing
    'feeling_present_text', 'preferred_therapy', 'consent_agreement'
]
In [4]:
# Preserve column name to question mapping
column_question_map = {k:v for k,v in zip(column_names, df.columns)}
# Rename all columns
df.columns = column_names
#Anonymize Initials Column
import random
import string
letter = string.ascii_uppercase
df.initials = df.initials.apply(lambda val: random.choice(letter) + random.choice(letter))

EDA on data collected so far¶

The data collection has started today and this is priliminary EDA report generated on the dataset. We will be looking into data collected in detail this week and try out visualizing it in lower dimensions with embedding techniques like TSNE and dimension reduction techniques like PCA and find patterns with cluster analysis.

In [6]:
from pandas_profiling import ProfileReport
ProfileReport(df, progress_bar=False)
Out[6]:

How will the dataset be helpful?¶

The dataset that we intend to collect consists of basic information regarding mental health condition four profiles i.e Anxiety, PTSD, Trauma and Substance abuse. We will implement the Machine learning model according to the data input by the patients to predict the mental health interventions strategies according to their symptoms and concerns. This will help reduce the time consumption in filtering and categorizing the patients according to their needs and requirements such that they can get the help they need from Cogni XR.

Data Fitness¶

Is the data sufficient for the use in your project?¶

As we are collecting data from the St. Clair college, we will get more data from the Cogni XR as well which will provide us with enough data for our analysis and model building. We are targeting at least 50 data from the patients which will be sufficient for the analysis.

Can you answer the research question using the data?¶

We will be categorizing the need and type of intervention for the mental health issues of the patients according to the level of their severity which will help the experts to prevent the time consumption in assessing the forms manually. With the help of the data collected we will be able to fulfill the requirement set by the customer/ CogniXR and obtain the result intended.

Ethical Assessment¶

Data Collection:¶

For collecting data, we are taking the following factors into account.

Consent: Users will be provided information on the data that is being collected. We have included a consent question in the form which basically implies that data has been collected and used only with the consent of the individual filling the form.

Bias: We have used different demographic questions like age,province,occupation,gender,income which will contribute towards a non biased data.

Personal Identifiable Information: The data is anonymised and no personal data from individuals is being collected .

Data Storage:¶

Data Security: Data will be stored on a cloud platform and protected in accordance with industry standards under the existing License agreement.

Right to remove individual data: We have provided an email contact for individuals who wish to update or remove the data that has been already collected.

Data retention plan: The dataset will be retained with CognXRr and Github and used for future analysis and reference

Analysis:¶

  1. Dataset Bias: Intake forms rely on the client to accurately report their symptoms and history, which may be subject to biases such as social desirability bias or recall bias. Clients may not report symptoms accurately due to stigma, shame, or lack of awareness, which may result in an inaccurate assessment of their mental health needs. To reduce biases design the survey or questionnaire, to ensure that the sample is representative of the population.

  2. Privacy in analysis: Depending on the nature of the data, it may be necessary to transform the variables or data structures in order to conduct meaningful analysis or visualization. This may involve recording variables, creating new variables, or aggregating the data in a different way.This may involve using models that have been specifically designed for interpretability, such as decision trees.

  3. Auditability: we are using github repository for tracking all the documents. Any changes in the code or in the project done by individuals will be reviewed before adding it to the final report.

Modeling¶

We have started collecting data from the intake form that has been confirmed by the client . For the model development we will be using the data previously collected and implementing the Machine learning model. Also after we have collected enough data for the form we will be using the model and improve and implement it in that as well.

Research questions:¶

  1. Assessing the severity of the patient
  2. What kind of therapy is suitable for the patient

Outcomes :¶

Solution implementation of the research problem:¶

We first developed a psychometric questionnaire through which we can collect data and prepare labels. We will be using ML models supervised learning methodology like decision tree, SVM, random forest and unsupervised learning clustering methodologies to prepare a data summary report which will provide the details about the patient and the urgency it requires for the therapy. Along with this factor analysis will also be done. To identify the urgency, we have variables through which we can understand the severity of the patient and assessing the type of therapy the person will be requiring. Once the data summary report is created we will be sending the automated report to the therapist email.

Future data needs and Potential Challenges:¶

The Questionnaire created has been updated multiple times according to the client needs. If any future changes in the questions of the survey form will lead to re-collection of the data and re-analysis